Log compaction failure error and delete temporarily blocks from disk #2261

pracucci · 2022-06-28T13:16:23Z

What this PR does

I've seen a compactor failing to compact some blocks, but the reason is missing from logs because we're not logging the actual error (the error returned by runCompactionJob() is never logged).

Once the compaction fails, source blocks are left on disk for later investigation. While this is a nice thing, it triggers another issue: subsequent job executions (other jobs) may run out of disk space, because previous failing compaction run jobs (failed) are left on disk. This is an issue which is also happening on the cluster I'm investigating.

In this PR I propose to fix both.

Which issue(s) this PR fixes or relates to

N/A

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Marco Pracucci <marco@pracucci.com>

pkg/compactor/bucket_compactor.go

replay · 2022-06-28T16:21:56Z

I agree that we shouldn't just keep those source blocks of failed compactions forever, but if we now make it impossible to analyze the source blocks of failed compactions because they always get deleted isn't it possible that in the future we'll end up in a situation where we'll miss having them because some compaction failed and we don't know why?

Would it make sense to add a runtime config flag which can make the compactor optionally not delete the source blocks of failed compactions of a specific tenant? That way if we do need to investigate some failing compactions we could just enable that flag to make the compactor not delete the source blocks of failed compactions of this tenant.

I'm not sure anymore, but I thought in the past we have encountered situations where the ability to analyze the source blocks of failed compactions has been useful to understand an issue, or am I misremembering that?

pstibrany · 2022-06-28T16:28:12Z

I'm not sure anymore, but I thought in the past we have encountered situations where the ability to analyze the source blocks of failed compactions has been useful to understand an issue, or am I misremembering that?

Source blocks are still available in the bucket.

pracucci · 2022-06-28T16:32:11Z

I'm not sure anymore, but I thought in the past we have encountered situations where the ability to analyze the source blocks of failed compactions has been useful to understand an issue, or am I misremembering that?

Source blocks are still available in the bucket.

As @pstibrany mentioned, you can download source blocks from object storage anytime. I think having blocks in the container disk is not much useful anyway. To debug them you will have to download them into a workstation where you have all your debugging tooling, so why not downloading them directly from the object storage (which is also faster to download).

pstibrany

LGTM, nice catch with unlogged error.

replay

LGTM, thanks!

…rafana#2261) * Log compaction failure error and delete temporarily blocks from disk Signed-off-by: Marco Pracucci <marco@pracucci.com> * Well, we have to always delete local dir Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fix unit tests Signed-off-by: Marco Pracucci <marco@pracucci.com>

Log compaction failure error and delete temporarily blocks from disk

20e760f

Signed-off-by: Marco Pracucci <marco@pracucci.com>

pracucci requested a review from pstibrany June 28, 2022 13:16

pracucci added 2 commits June 28, 2022 15:17

Well, we have to always delete local dir

da1d484

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Fix unit tests

300a9c3

Signed-off-by: Marco Pracucci <marco@pracucci.com>

replay reviewed Jun 28, 2022

View reviewed changes

pkg/compactor/bucket_compactor.go Show resolved Hide resolved

pstibrany approved these changes Jun 28, 2022

View reviewed changes

replay approved these changes Jun 28, 2022

View reviewed changes

pracucci merged commit e9bba6b into main Jun 29, 2022

pracucci deleted the fix-compactor-on-error branch June 29, 2022 05:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log compaction failure error and delete temporarily blocks from disk #2261

Log compaction failure error and delete temporarily blocks from disk #2261

pracucci commented Jun 28, 2022

replay commented Jun 28, 2022 •

edited

Loading

pstibrany commented Jun 28, 2022

pracucci commented Jun 28, 2022 •

edited

Loading

pstibrany left a comment

replay left a comment

Log compaction failure error and delete temporarily blocks from disk #2261

Log compaction failure error and delete temporarily blocks from disk #2261

Conversation

pracucci commented Jun 28, 2022

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

replay commented Jun 28, 2022 • edited Loading

pstibrany commented Jun 28, 2022

pracucci commented Jun 28, 2022 • edited Loading

pstibrany left a comment

Choose a reason for hiding this comment

replay left a comment

Choose a reason for hiding this comment

replay commented Jun 28, 2022 •

edited

Loading

pracucci commented Jun 28, 2022 •

edited

Loading